Introduction

How much homophony should we expect for a given wordform, as a function of its phonotactics and word length? Is this expected homophony more or less than the observed homophony for that wordform? And critically: does this Homophony Delta vary across wordforms as a function of Wordform Frequency?

That is, if there is a pressure against homophony in real lexica (Trott & Bergen, 2020), that pressure might manifest unevenly across the lexicon. Based on previous work, it is possible that the pressure against homophony is actually weakest for long wordforms, as well as phonotactically implausible wordforms. Why would this be the case? It seems counterintuitive to suppose that a lexicon designed for efficiency would not optimize the use of its wordforms by concentrating homophones among the shortest, most phonotactically plausible wordforms.

One explanation is that word length is correlated with frequency. A negative selection pressure might be strongest among wordforms for which, if ambiguous, would require very frequent disambiguation. If this interpretation is correct, Homophony Delta should correlate negatively with Frequency.

Again, this prediction appears counterintuitive, given the extensive evidence that wordform frequency is positively correlated with ambiguity, whether measured in terms of Number of homophones or Number of senses (Zipf, 1945; Piantadosi et al, 2012). But this observation would ultimately still be consistent with a mechanism selecting against homophones in the most frequent wordforms. The point is not that frequent wordforms should have fewer homophones than long wordforms; rather, frequent wordforms should have fewer homophones than one would expect on account of their phonotactics than length, while this gap, or Homophony Delta, should be smaller (or even inverted) for infrequent wordforms.

Calculating expected homophony

Given \(M\) meanings to express, and \(W\) wordforms with which to express them, each associated with some probability \(p_i\), how many meanings should each wordform \(w_i\) carry as a function of its normalized probability \(p_i\)?

Helper functions

We also define several helper functions to simplify the data processing pipeline.

First, we need a function that reads in an aggregated lexicon (i.e., across homophonous wordforms), as well as the original lexicon with original entries intact; this function will then identify the corrected number of meanings M per word length, and merge that information with the aggregated lexicon.

read_data = function(agg_path) {
  
  # Read in aggregated data
  df_agg = read_csv(agg_path)
  
  # Below, we identify the number of *words* of each word length.
  df_total_counts = df_agg %>%
    mutate(k_actual = num_homophones + 1) %>%
    group_by(num_sylls_est) %>%
    summarise(M = sum(k_actual))
  
  # Merge with counts
  df_agg = merge(df_agg, df_total_counts, by = "num_sylls_est")
  nrow(df_agg)
  
  df_agg
}

We then define a function to normalize the probability of a given wordform relative to the sum of probabilities of wordforms for that length.

normalize_probabilities = function(df) {
  ### For a given dataframe, normalize probabilities "p" relative to #words of a given elgnth.
  
  ## First get probabilities
  df = df %>%
    mutate(normalized_surprisal = heldout_surprisal/num_phones,
           p = 10 ** heldout_log_prob)
  
  # Calculate sum(p) for each syllable length bucket.
  df_syl_sums = df %>%
    group_by(num_sylls_est) %>%
    summarise(total_prob = sum(p))
  
  # Merge with aggregated data, and normalize
  df = df %>%
    left_join(df_syl_sums, on = "num_sylls_est") %>%
    mutate(p_normalized = p / total_prob)
  
  df
}

We now define a function to compute the expected number of homophones \(k_i\) for a given wordform \(w_i\). This will also calculate the delta between the real and expected amount.

compute_expected_homophones = function(df) {
  
  ## Compute expected homophones "k" of a wordform by multiplying normalized probability
  ## "p" by the number of meanings "M".
  
  df = df %>%
    # Expected number of "successes" (entries), minus 1
    mutate(k = p_normalized * M - 1) %>%
    mutate(delta = num_homophones - k)
  
  df
}

We also define a function to run the main regression:

run_regression = function(df) {
  
  # Initialize output
  out = list()
  
  # Build full model
  model_full = lm(data = df,
                  delta ~ frequency + num_sylls_est + normalized_surprisal)

  # Build reduced model
  model_reduced = lm(data = df,
                     delta ~ num_sylls_est + normalized_surprisal)
  
  # Run model comparison
  comparison = anova(model_reduced, model_full)
  df_comp = broom::tidy(comparison) %>%
    na.omit()
  
  # Tidy model output
  df_model = broom::tidy(model_full)
  
  # Add to list parameters
  out$model = df_model
  out$comparison = df_comp
  
  out
}

And several plotting functions:

plot_outcome = function(df_output) {
  
  df_output$model$predictor = fct_recode(
    df_output$model$term,
    "Number of Syllables" = "num_sylls_est",
    "Normalized Surprisal" = "normalized_surprisal",
    "Log(Frequency)" = "frequency"
    # "Neighborhood Size" = "neighborhood_size"
  )
  
  ### Plots outcome of regression
  g = df_output$model %>%
    ggplot(aes(x = predictor,
               y = estimate)) +
    geom_point() +
    coord_flip() +
    geom_hline(yintercept = 0, linetype = "dotted") +
    geom_errorbar(aes(ymin = estimate - 2*std.error, 
                      ymax = estimate + 2*std.error), 
                  width=.2,
                  position=position_dodge(.9)) +
    labs(x = "Predictor",
         y = "Estimate") +
    theme_minimal()
  
  g
}

plot_comparison = function(df) {
  
  # Plots expected vs. actual per each wordform.
  g = df %>%
    ggplot(aes(x = k,
               y = num_homophones)) +
    geom_point(alpha = .5) +
    scale_x_continuous(limits = c(-1, max(df$k))) +
    scale_y_continuous(limits = c(0, max(df$k))) +
    geom_abline(intercept = 0, slope = 1, linetype = "dotted") + 
    labs(x = "Expected number of homophones",
         y = "Actual number of homophones") +
    theme_minimal()
  
  g
}


plot_bins = function(df, n, column, label) {
  
  # Plots delta ~ frequency bins.
  
  df$binned = as.numeric(ntile(pull(df, column), n = n))
  
  g = df %>%
    group_by(binned) %>%
    summarise(mean_delta = mean(delta),
              se_delta = sd(delta) / sqrt(n())) %>%
    ggplot(aes(x = binned,
               y = mean_delta)) +
    geom_point(stat = "identity", size = .2) +
    geom_hline(yintercept = 0, linetype = "dotted") +
    geom_errorbar(aes(ymin = mean_delta - se_delta, 
                      ymax = mean_delta + se_delta), 
                  width=.2,
                  position=position_dodge(.9)) +
    labs(x = label,
         y = "Delta (Real - Expected)") +
    geom_smooth() +
    theme_minimal()

    g
}


plot_contrast = function(df, bins = 20) {
  
  df$frequency_binned = as.numeric(ntile(df$frequency, n = bins))
  
  g = df %>%
    mutate(expected = k,
           actual = num_homophones) %>%
    pivot_longer(c(expected, actual), names_to = "model", values_to = "homophones") %>%
    ggplot(aes(x = frequency_binned,
               y = homophones,
               color = model)) +
    stat_summary (fun = function(x){mean(x)},
                  fun.min = function(x){mean(x) - sd(x)/sqrt(length(x))},
                  fun.max = function(x){mean(x) + sd(x)/sqrt(length(x))},
                  geom= 'pointrange') +
                  # position=position_dodge(width=0.95)) +
    labs(x = "Binned Frequency",
         y = "Number of Homophones") +
    theme_bw()
  
  g
}

Homophony Delta ~ Frequency

English

First, we load and process the data.

## setwd("/Users/seantrott/Dropbox/UCSD/Research/Ambiguity/Evolution/homophony_delta/analyses")
df_english = read_data("../data/processed/english/reals/english_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   index = col_double(),
##   Word = col_character(),
##   CobLog = col_double(),
##   CompCnt = col_double(),
##   PhonDISC = col_character(),
##   Class = col_character(),
##   SylCnt = col_double(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double(),
##   rank_num_homophones = col_double(),
##   neighborhood_size = col_double(),
##   heldout_log_prob = col_double(),
##   heldout_surprisal = col_double()
## )
nrow(df_english)
## [1] 35107
## Now normalize probabilities and compute expected homophones per wordform
df_english = df_english %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_english %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 9 × 3
##   num_sylls_est total correct_total
##           <dbl> <dbl>         <dbl>
## 1             1  7706          7706
## 2             2 15247         15247
## 3             3 11379         11379
## 4             4  5316          5316
## 5             5  1694          1694
## 6             6   439           439
## 7             7    95            95
## 8             8    10            10
## 9            10     1             1

We then merge with data for frequency of each individual lemma. (We also calculate entropy over senses in the process.)

df_english_lemmas = read_delim("../data/frequency/english/lemmas.csv", delim = "\\",
                               quote = "")
## Parsed with column specification:
## cols(
##   PhonDISC = col_character(),
##   Cob = col_double()
## )
nrow(df_english_lemmas)
## [1] 52447
df_english_lemmas = df_english_lemmas %>%
  mutate(freq_adjusted = Cob + 1) %>%
  group_by(PhonDISC) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_english_lemmas %>%
  group_by(PhonDISC) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)),
            total_frequency = mean(total_frequency))

df_merged_english = df_english %>%
  # inner_join(df_entropy, by = "PhonDISC")
  inner_join(df_entropy, by = "PhonDISC")

nrow(df_merged_english)
## [1] 28914
df_merged_english$frequency = log10(df_merged_english$total_frequency)

Now we can visualize the outcome in a variety of ways:

df_merged_english %>%
  plot_comparison()

df_merged_english %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_english %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -6.54     0.362     -18.1  1.43e- 72
## 2 frequency              -0.494    0.0650     -7.59 3.22e- 14
## 3 num_sylls_est           0.727    0.0730      9.96 2.50e- 23
## 4 normalized_surprisal    3.58     0.131      27.3  7.43e-162
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  28910 2742636.     1 5469.      57.7 3.22e-14

Directly contrast the relationships:

df_merged_english %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

nrow(filter(df_merged_english, num_homophones == 1))
## [1] 4149
model_full = lm(data = filter(df_merged_english, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_merged_english, num_homophones == 1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -246.606   -0.603    1.029    2.554    6.610 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -7.4331     0.9999  -7.434 1.27e-13 ***
## entropy                1.1894     0.6551   1.816 0.069501 .  
## frequency             -0.9682     0.1731  -5.594 2.36e-08 ***
## num_sylls_est          0.7544     0.1955   3.860 0.000115 ***
## normalized_surprisal   4.6364     0.3064  15.133  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.805 on 4144 degrees of freedom
## Multiple R-squared:  0.07287,    Adjusted R-squared:  0.07197 
## F-statistic: 81.42 on 4 and 4144 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_english, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1   4145 321539                             
## 2   4144 321283  1    255.57 3.2965 0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dutch

First, we load and process the data.

df_dutch = read_data("../data/processed/dutch/reals/dutch_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   PhonStrsDISC = col_character(),
##   PhonCVBr = col_character(),
##   PhonSylBCLX = col_character(),
##   PhonStrsStDISC = col_character(),
##   PhonStCVBr = col_character(),
##   PhonSylStBCLX = col_character(),
##   PhonolCLX = col_character(),
##   PhonolCPA = col_character(),
##   PhonDISC = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_dutch)
## [1] 65260
## Now normalize probabilities and compute expected homophones per wordform
df_dutch = df_dutch %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_dutch %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 9 × 3
##   num_sylls_est total correct_total
##           <dbl> <dbl>         <dbl>
## 1             1  4672          4672
## 2             2 20588         20588
## 3             3 26420         26420
## 4             4 11338         11338
## 5             5  3394          3394
## 6             6   846           846
## 7             7   189           189
## 8             8    28            28
## 9             9     2             2

We then merge with data for frequency of each individual lemma. (We also calculate entropy over senses in the process.)

df_dutch_lemmas = read_delim("../data/frequency/dutch/lemmas.csv", delim = "\\")
## Parsed with column specification:
## cols(
##   PhonDISC = col_character(),
##   Inl = col_double()
## )
nrow(df_dutch_lemmas)
## [1] 123926
df_dutch_lemmas = df_dutch_lemmas %>%
  mutate(freq_adjusted = Inl + 1) %>%
  group_by(PhonDISC) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_dutch_lemmas %>%
  group_by(PhonDISC) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)),
            total_frequency = mean(total_frequency))


df_merged_dutch = df_dutch %>%
  # inner_join(df_entropy, by = "PhonDISC")
  inner_join(df_entropy, by = "PhonDISC")

nrow(df_merged_dutch)
## [1] 63364
df_merged_dutch$frequency = log10(df_merged_dutch$total_frequency)

Now we can visualize the outcome in a variety of ways:

df_merged_dutch %>%
  plot_comparison()

df_merged_dutch %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_dutch %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -5.02     0.385     -13.0  9.43e- 39
## 2 frequency              -1.85     0.0688    -26.9  3.35e-158
## 3 num_sylls_est           0.402    0.0712      5.65 1.66e-  8
## 4 normalized_surprisal    4.04     0.165      24.4  3.48e-131
output$comparison
## # A tibble: 1 × 6
##   res.df       rss    df   sumsq statistic   p.value
##    <dbl>     <dbl> <dbl>   <dbl>     <dbl>     <dbl>
## 1  63360 16265983.     1 185420.      722. 3.35e-158

Directly contrast the relationships:

df_merged_dutch %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

nrow(filter(df_merged_dutch, num_homophones == 1))
## [1] 1630
model_full = lm(data = filter(df_merged_dutch, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_merged_dutch, num_homophones == 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -562.49   -2.13    2.17    5.85   20.02 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -4.1691     3.2415  -1.286    0.199    
## entropy               -2.8531     2.4407  -1.169    0.243    
## frequency             -4.4829     0.5858  -7.653 3.36e-14 ***
## num_sylls_est         -0.3627     0.6626  -0.547    0.584    
## normalized_surprisal   8.4880     1.0560   8.038 1.74e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.99 on 1625 degrees of freedom
## Multiple R-squared:  0.08019,    Adjusted R-squared:  0.07793 
## F-statistic: 35.42 on 4 and 1625 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_dutch, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1626 786128                           
## 2   1625 785468  1    660.52 1.3665 0.2426

German

df_german = read_data("../data/processed/german/reals/german_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   index = col_double(),
##   Word = col_character(),
##   CompCnt = col_double(),
##   PhonDISC = col_character(),
##   SylCnt = col_double(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double(),
##   rank_num_homophones = col_double(),
##   neighborhood_size = col_double(),
##   heldout_log_prob = col_double(),
##   heldout_surprisal = col_double()
## )
nrow(df_german)
## [1] 50435
## Now normalize probabilities and compute expected homophones per wordform
df_german = df_german %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_german %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 10 × 3
##    num_sylls_est total correct_total
##            <dbl> <dbl>         <dbl>
##  1             1  2454          2454
##  2             2 13181         13181
##  3             3 19429         19429
##  4             4 10983         10983
##  5             5  3971          3971
##  6             6  1240          1240
##  7             7   328           328
##  8             8    72            72
##  9             9    16            16
## 10            10     2             2

We then merge with data for frequency of each individual lemma. (We also calculate entropy over senses in the process.)

df_german_lemmas = read_delim("../data/frequency/german/lemmas.csv", delim = "\\")
## Parsed with column specification:
## cols(
##   PhonDISC = col_character(),
##   Mann = col_double()
## )
nrow(df_german_lemmas)
## [1] 51728
df_german_lemmas = df_german_lemmas %>%
  mutate(freq_adjusted = Mann + 1) %>%
  group_by(PhonDISC) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_german_lemmas %>%
  group_by(PhonDISC) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)),
            total_frequency = mean(total_frequency))

df_merged_german = df_german %>%
  # inner_join(df_entropy, by = "PhonDISC")
  inner_join(df_entropy, by = "PhonDISC")

nrow(df_merged_german)
## [1] 46720
df_merged_german$frequency = log10(df_merged_german$total_frequency)

Now we can visualize the outcome in a variety of ways:

df_merged_german %>%
  plot_comparison()

df_merged_german %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_german %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)            -4.59     0.428     -10.7  8.05e-27
## 2 frequency              -1.28     0.0984    -13.1  6.93e-39
## 3 num_sylls_est           0.362    0.0734      4.93 8.19e- 7
## 4 normalized_surprisal    3.42     0.182      18.7  4.84e-78
output$comparison
## # A tibble: 1 × 6
##   res.df       rss    df  sumsq statistic  p.value
##    <dbl>     <dbl> <dbl>  <dbl>     <dbl>    <dbl>
## 1  46716 11461734.     1 41818.      170. 6.93e-39

Directly contrast the relationships:

df_merged_german %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

nrow(filter(df_merged_german, num_homophones == 1))
## [1] 1042
model_full = lm(data = filter(df_merged_german, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_merged_german, num_homophones == 1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -255.460   -1.288    2.997    6.323   16.926 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.5286     3.7211  -0.411   0.6813    
## entropy               -5.9785     2.7656  -2.162   0.0309 *  
## frequency             -4.5311     0.6498  -6.973 5.50e-12 ***
## num_sylls_est         -0.5602     0.7199  -0.778   0.4367    
## normalized_surprisal   7.9448     1.1008   7.217 1.03e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.05 on 1037 degrees of freedom
## Multiple R-squared:  0.1082, Adjusted R-squared:  0.1048 
## F-statistic: 31.47 on 4 and 1037 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_german, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   1038 339256                              
## 2   1037 337734  1      1522 4.6732 0.03086 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

French

df_french = read_data("../data/processed/french/reals/french_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `1_ortho` = col_character(),
##   `2_phon` = col_character(),
##   `3_lemme` = col_character(),
##   `4_cgram` = col_character(),
##   `5_genre` = col_character(),
##   `6_nombre` = col_character(),
##   `11_infover` = col_character(),
##   `17_cvcv` = col_character(),
##   `18_p_cvcv` = col_character(),
##   `23_syll` = col_character(),
##   `25_cv-cv` = col_character(),
##   `26_orthrenv` = col_character(),
##   `27_phonrenv` = col_character(),
##   `28_orthosyll` = col_character(),
##   `29_cgramortho` = col_character(),
##   `34_morphoder` = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_french)
## [1] 37278
## Now normalize probabilities and compute expected homophones per wordform
df_french = df_french %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_french %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 9 × 3
##   num_sylls_est total correct_total
##           <dbl> <dbl>         <dbl>
## 1             1  3893          3893
## 2             2 13794         13794
## 3             3 15258         15258
## 4             4  7589          7589
## 5             5  2466          2466
## 6             6   670           670
## 7             7    97            97
## 8             8    12            12
## 9             9     3             3

Read in French lemmas.

df_french_lemmas = read_csv("../data/processed/french/reals/french_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `1_ortho` = col_character(),
##   `2_phon` = col_character(),
##   `3_lemme` = col_character(),
##   `4_cgram` = col_character(),
##   `5_genre` = col_character(),
##   `6_nombre` = col_character(),
##   `11_infover` = col_character(),
##   `17_cvcv` = col_character(),
##   `18_p_cvcv` = col_character(),
##   `23_syll` = col_character(),
##   `25_cv-cv` = col_character(),
##   `26_orthrenv` = col_character(),
##   `27_phonrenv` = col_character(),
##   `28_orthosyll` = col_character(),
##   `29_cgramortho` = col_character(),
##   `34_morphoder` = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_french_lemmas)
## [1] 43782
sum(df_french_lemmas$`8_freqlemlivres`)
## [1] 877113.8
df_french_lemmas = df_french_lemmas %>%
  ## Multiply by 14.8, b/c measures were divided by 14.8 in Lexique
  mutate(freq_adjusted = `8_freqlemlivres` * 14.8) %>%
  group_by(`2_phon`) %>%
  summarise(total_frequency = sum(freq_adjusted))

nrow(df_french_lemmas)
## [1] 37278
sum(df_french_lemmas$total_frequency)
## [1] 12981285
df_french_merged = df_french %>%
  inner_join(df_french_lemmas, by = "2_phon")

nrow(df_french)
## [1] 37278
nrow(df_french_merged)
## [1] 37278
df_french_merged$frequency = log10(df_french_merged$total_frequency + 1)

And finally, we can run the regression and visualize model coefficients:

output = df_french_merged %>% 
  run_regression()

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -5.53     0.277     -20.0  2.41e- 88
## 2 frequency              -0.446    0.0524     -8.51 1.82e- 17
## 3 num_sylls_est           0.563    0.0481     11.7  1.55e- 31
## 4 normalized_surprisal    3.18     0.109      29.1  1.47e-184
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  37274 2687620.     1 5221.      72.4 1.82e-17

Directly contrast the relationships:

df_french_merged %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

df_french_lemmas = read_csv("../data/processed/french/reals/french_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `1_ortho` = col_character(),
##   `2_phon` = col_character(),
##   `3_lemme` = col_character(),
##   `4_cgram` = col_character(),
##   `5_genre` = col_character(),
##   `6_nombre` = col_character(),
##   `11_infover` = col_character(),
##   `17_cvcv` = col_character(),
##   `18_p_cvcv` = col_character(),
##   `23_syll` = col_character(),
##   `25_cv-cv` = col_character(),
##   `26_orthrenv` = col_character(),
##   `27_phonrenv` = col_character(),
##   `28_orthosyll` = col_character(),
##   `29_cgramortho` = col_character(),
##   `34_morphoder` = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
df_french_lemmas$freq_adjusted = df_french_lemmas$`8_freqlemlivres` * 14.8 + 1
nrow(df_french_lemmas)
## [1] 43782
df_french_lemmas = df_french_lemmas %>%
  # mutate(freq_adjusted = log10(frequency + 1) + .01) %>%
  group_by(`2_phon`) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_french_lemmas %>%
  group_by(`2_phon`) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)))


df_french_entropy = df_french_merged %>%
  inner_join(df_entropy, by = "2_phon")
nrow(df_french_entropy)
## [1] 37278
nrow(df_french_merged)
## [1] 37278
nrow(filter(df_french_entropy, num_homophones == 1))
## [1] 4540
model_full = lm(data = filter(df_french_entropy, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_french_entropy, num_homophones == 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -525.90   -0.21    1.14    2.08    5.54 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.74449    1.24840  -4.601 4.31e-06 ***
## entropy               0.09027    0.98403   0.092 0.926910    
## frequency            -0.81971    0.22840  -3.589 0.000336 ***
## num_sylls_est         0.83454    0.21022   3.970 7.30e-05 ***
## normalized_surprisal  3.69626    0.39147   9.442  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.13 on 4535 degrees of freedom
## Multiple R-squared:  0.02544,    Adjusted R-squared:  0.02458 
## F-statistic:  29.6 on 4 and 4535 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_french_entropy, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   4536 562031                           
## 2   4535 562030  1     1.043 0.0084 0.9269

Japanese

df_japanese = read_data("../data/processed/japanese/reals/japanese_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   orth_form_kanji = col_character(),
##   orth_form_hiragana = col_character(),
##   orth_form_romaji = col_character(),
##   phonetic_form = col_character(),
##   morph_form = col_character(),
##   glosses = col_character(),
##   multiple_pronunications = col_logical(),
##   phonetic_remapped = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_japanese)
## [1] 40449
## Now normalize probabilities and compute expected homophones per wordform
df_japanese = df_japanese %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_japanese %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 17 × 3
##    num_sylls_est total correct_total
##            <dbl> <dbl>         <dbl>
##  1             1  1011          1011
##  2             2  7710          7710
##  3             3 14816         14816
##  4             4 14237         14237
##  5             5  6619          6619
##  6             6  3673          3673
##  7             7  1657          1657
##  8             8   733           733
##  9             9   345           345
## 10            10   185           185
## 11            11    72            72
## 12            12    52            52
## 13            13    19            19
## 14            14     8             8
## 15            15     7             7
## 16            16     2             2
## 17            22     1             1

Japanese already has frequency data.

df_japanese_lemmas = read_csv("../data/processed/japanese/reals/japanese_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   orth_form_kanji = col_character(),
##   orth_form_hiragana = col_character(),
##   orth_form_romaji = col_character(),
##   phonetic_form = col_character(),
##   morph_form = col_character(),
##   frequency = col_double(),
##   glosses = col_character(),
##   multiple_pronunications = col_logical(),
##   phonetic_remapped = col_character(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double()
## )
nrow(df_japanese_lemmas)
## [1] 51147
df_japanese_lemmas = df_japanese_lemmas %>%
  group_by(phonetic_remapped) %>%
  summarise(total_frequency = mean(frequency))

nrow(df_japanese_lemmas)
## [1] 40449
df_japanese_merged = df_japanese %>%
  inner_join(df_japanese_lemmas, by = "phonetic_remapped")

nrow(df_japanese)
## [1] 40449
nrow(df_japanese_merged)
## [1] 40449
df_japanese_merged$frequency = log10(df_japanese_merged$total_frequency + 1)

And finally, we can run the regression and visualize model coefficients:

output = df_japanese_merged %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -3.83     0.144     -26.5  7.36e-154
## 2 frequency              -1.73     0.106     -16.4  5.95e- 60
## 3 num_sylls_est           0.129    0.0183      7.05 1.85e- 12
## 4 normalized_surprisal    2.92     0.0851     34.3  1.91e-254
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  40445 1206484.     1 7982.      268. 5.95e-60

Directly contrast the relationships:

df_japanese_merged %>%
  plot_contrast(bins = 20)

Mandarin: CallHome

df_mandarin = read_data("../data/processed/mandarin/reals/mandarin_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Key = col_character(),
##   `PY+T` = col_character(),
##   `Pho+T` = col_character(),
##   `IPA+T` = col_character(),
##   `PY-T` = col_character(),
##   `Pho-T` = col_character(),
##   `IPA-T` = col_character(),
##   Tone = col_character(),
##   SyStruct = col_character(),
##   Words = col_character(),
##   `Dominate-POS` = col_character(),
##   `Other-POSes` = col_character(),
##   Neighbors = col_character(),
##   file = col_character(),
##   homophonous_words = col_character(),
##   word = col_character(),
##   glides_remapped = col_character(),
##   phonetic_remapped = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_mandarin)
## [1] 45871
## Now normalize probabilities and compute expected homophones per wordform
df_mandarin = df_mandarin %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
nrow(df_mandarin)
## [1] 45871
## Double-check that this equals correct amount in lexicon
df_mandarin %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 12 × 3
##    num_sylls_est total correct_total
##            <dbl> <dbl>         <dbl>
##  1             1  4658          4658
##  2             2 32372         32372
##  3             3  9186          9186
##  4             4  5336          5336
##  5             5   312           312
##  6             6    97            97
##  7             7    56            56
##  8             8    17            17
##  9             9     3             3
## 10            10     6             6
## 11            11     3             3
## 12            14     1             1

Mandarin already has frequency data, but we need to make sure it's summed across lemmas.

df_mandarin_lemmas = read_csv("../data/processed/mandarin/reals/mandarin_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Key = col_character(),
##   `PY+T` = col_character(),
##   `Pho+T` = col_character(),
##   `IPA+T` = col_character(),
##   `PY-T` = col_character(),
##   `Pho-T` = col_character(),
##   `IPA-T` = col_character(),
##   Tone = col_character(),
##   SyStruct = col_character(),
##   Words = col_character(),
##   `Dominate-POS` = col_character(),
##   `Other-POSes` = col_character(),
##   Neighbors = col_character(),
##   file = col_character(),
##   homophonous_words = col_character(),
##   word = col_character(),
##   glides_remapped = col_character(),
##   phonetic_remapped = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
df_mandarin_lemmas = df_mandarin_lemmas %>%
  mutate(frequency = FreqPM * 1000000) %>%
  group_by(phonetic_remapped) %>%
  summarise(total_frequency = sum(frequency))

nrow(df_mandarin_lemmas)
## [1] 45871
df_mandarin_merged = df_mandarin %>%
  inner_join(df_mandarin_lemmas, by = "phonetic_remapped")

nrow(df_mandarin)
## [1] 45871
nrow(df_mandarin_merged)
## [1] 45871
df_mandarin_merged$frequency = log10(df_mandarin_merged$total_frequency + 1)

Now we can visualize the outcome in a variety of ways:

df_mandarin_merged %>%
  plot_comparison()

df_mandarin_merged %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_mandarin_merged %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)           -3.77      0.325     -11.6  4.16e-31
## 2 frequency             -0.284     0.0548     -5.19 2.10e- 7
## 3 num_sylls_est          0.0996    0.0625      1.59 1.11e- 1
## 4 normalized_surprisal   4.37      0.256      17.1  2.67e-65
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df sumsq statistic     p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>       <dbl>
## 1  45867 4453802.     1 2617.      26.9 0.000000210

Directly contrast the relationships:

df_mandarin_merged %>%
  plot_contrast(bins = 20)

Mandarin: Chinese Lexical Database

Here, we calculate the Homophony Delta for Mandarin Chinese, using the Chinese Lexical Database.

df_mandarin_cld = read_data("../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   C1 = col_character(),
##   C2 = col_character(),
##   C3 = col_logical(),
##   C4 = col_logical(),
##   C1Structure = col_character(),
##   C2Structure = col_character(),
##   C3Structure = col_logical(),
##   C4Structure = col_logical(),
##   C1Type = col_character(),
##   C2Type = col_character(),
##   C3Type = col_logical(),
##   C4Type = col_logical(),
##   Pinyin = col_character(),
##   C1Pinyin = col_character(),
##   C2Pinyin = col_character(),
##   C3Pinyin = col_logical(),
##   C4Pinyin = col_logical(),
##   IPA = col_character(),
##   C1IPA = col_character()
##   # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 712019 parsing failures.
##   row         col           expected actual                                                                            file
## 30786 C3          1/0/T/F/TRUE/FALSE  么    '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Structure 1/0/T/F/TRUE/FALSE  SG    '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Type      1/0/T/F/TRUE/FALSE  Other '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Pinyin    1/0/T/F/TRUE/FALSE  me5   '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3IPA       1/0/T/F/TRUE/FALSE  mɤ5   '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## ..... ........... .................. ...... ...............................................................................
## See problems(...) for more details.
nrow(df_mandarin_cld)
## [1] 41009
## Now normalize probabilities and compute expected homophones per wordform
df_mandarin_cld = df_mandarin_cld %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
nrow(df_mandarin_cld)
## [1] 41009
## Double-check that this equals correct amount in lexicon
df_mandarin_cld %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 4 × 3
##   num_sylls_est total correct_total
##           <dbl> <dbl>         <dbl>
## 1             1  3648          3648
## 2             2 31637         31637
## 3             3  6930          6930
## 4             4  3337          3337

Mandarin already has frequency data, but we need to make sure it's summed across lemmas.

df_mandarin_lemmas = read_csv("../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   C1 = col_character(),
##   C2 = col_character(),
##   C3 = col_logical(),
##   C4 = col_logical(),
##   C1Structure = col_character(),
##   C2Structure = col_character(),
##   C3Structure = col_logical(),
##   C4Structure = col_logical(),
##   C1Type = col_character(),
##   C2Type = col_character(),
##   C3Type = col_logical(),
##   C4Type = col_logical(),
##   Pinyin = col_character(),
##   C1Pinyin = col_character(),
##   C2Pinyin = col_character(),
##   C3Pinyin = col_logical(),
##   C4Pinyin = col_logical(),
##   IPA = col_character(),
##   C1IPA = col_character()
##   # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 715162 parsing failures.
##   row         col           expected actual                                                                     file
## 35286 C3          1/0/T/F/TRUE/FALSE  么    '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Structure 1/0/T/F/TRUE/FALSE  SG    '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Type      1/0/T/F/TRUE/FALSE  Other '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Pinyin    1/0/T/F/TRUE/FALSE  me5   '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3IPA       1/0/T/F/TRUE/FALSE  mɤ5   '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## ..... ........... .................. ...... ........................................................................
## See problems(...) for more details.
nrow(df_mandarin_lemmas)
## [1] 45552
df_mandarin_lemmas = df_mandarin_lemmas %>%
  group_by(phonetic_remapped) %>%
  summarise(total_frequency = sum(FrequencyRaw))

nrow(df_mandarin_lemmas)
## [1] 41009
df_mandarin_merged = df_mandarin_cld %>%
  inner_join(df_mandarin_lemmas, by = "phonetic_remapped")

nrow(df_mandarin)
## [1] 45871
nrow(df_mandarin_merged)
## [1] 41009
df_mandarin_merged$frequency = log10(df_mandarin_merged$total_frequency)

Now we can visualize the outcome in a variety of ways:

df_mandarin_merged %>%
  plot_comparison()

df_mandarin_merged %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_mandarin_merged %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          -1.85       0.0987   -18.7   5.10e-78
## 2 frequency            -0.275      0.0156   -17.6   8.29e-69
## 3 num_sylls_est         0.00764    0.0221     0.345 7.30e- 1
## 4 normalized_surprisal  2.71       0.0683    39.7   0
output$comparison
## # A tibble: 1 × 6
##   res.df     rss    df sumsq statistic  p.value
##    <dbl>   <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  41005 303218.     1 2281.      309. 8.29e-69

Directly contrast the relationships:

df_mandarin_merged %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

df_mandarin_lemmas = read_csv("../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   C1 = col_character(),
##   C2 = col_character(),
##   C3 = col_logical(),
##   C4 = col_logical(),
##   C1Structure = col_character(),
##   C2Structure = col_character(),
##   C3Structure = col_logical(),
##   C4Structure = col_logical(),
##   C1Type = col_character(),
##   C2Type = col_character(),
##   C3Type = col_logical(),
##   C4Type = col_logical(),
##   Pinyin = col_character(),
##   C1Pinyin = col_character(),
##   C2Pinyin = col_character(),
##   C3Pinyin = col_logical(),
##   C4Pinyin = col_logical(),
##   IPA = col_character(),
##   C1IPA = col_character()
##   # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 715162 parsing failures.
##   row         col           expected actual                                                                     file
## 35286 C3          1/0/T/F/TRUE/FALSE  么    '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Structure 1/0/T/F/TRUE/FALSE  SG    '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Type      1/0/T/F/TRUE/FALSE  Other '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Pinyin    1/0/T/F/TRUE/FALSE  me5   '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3IPA       1/0/T/F/TRUE/FALSE  mɤ5   '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## ..... ........... .................. ...... ........................................................................
## See problems(...) for more details.
nrow(df_mandarin_lemmas)
## [1] 45552
df_mandarin_lemmas = df_mandarin_lemmas %>%
  mutate(freq_adjusted = FrequencyRaw + 1) %>%
  # mutate(freq_adjusted = log10(frequency + 1) + .01) %>%
  group_by(phonetic_remapped) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_mandarin_lemmas %>%
  group_by(phonetic_remapped) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)))


df_mandarin_entropy = df_mandarin_merged %>%
  inner_join(df_entropy, by = "phonetic_remapped")
nrow(df_mandarin_entropy)
## [1] 41009
nrow(filter(df_mandarin_entropy, num_homophones == 1))
## [1] 1871
model_full = lm(data = filter(df_mandarin_entropy, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_mandarin_entropy, num_homophones == 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.710  -0.380   0.886   1.827   4.344 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.7619     0.9007  -3.067  0.00220 ** 
## entropy               -1.3703     0.4268  -3.211  0.00135 ** 
## frequency             -0.7130     0.1257  -5.670 1.65e-08 ***
## num_sylls_est          0.5328     0.2405   2.216  0.02683 *  
## normalized_surprisal   4.4816     0.4227  10.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.655 on 1866 degrees of freedom
## Multiple R-squared:  0.07813,    Adjusted R-squared:  0.07615 
## F-statistic: 39.54 on 4 and 1866 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_mandarin_entropy, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1   1867 25067                                
## 2   1866 24929  1    137.71 10.308 0.001347 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizations

Combine lexica

Below, we combine the lexica so that we can visualize the central relationships in the same plot.

df_merged_english_f = df_merged_english %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'English') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_merged_dutch_f = df_merged_dutch %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'Dutch') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_merged_german_f = df_merged_german %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'German') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_french_f = df_french_merged %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'French') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_japanese_f = df_japanese_merged %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'Japanese') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_mandarin_cld_f = df_mandarin_merged %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'Mandarin') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)


df_all_lexica = df_merged_english_f %>%
  rbind(df_merged_dutch_f) %>%
  rbind(df_merged_german_f) %>%
  rbind(df_french_f) %>%
  rbind(df_japanese_f) %>%
  rbind(df_mandarin_cld_f) 

Visualize relationships

PlotTheme = theme(
  axis.title.x = element_text(size = 16),
  axis.text.x = element_text(size = 14),
  axis.text.y = element_text(size = 14),
  axis.title.y = element_text(size = 16),
  strip.text.x = element_text(size = 16),
  title = element_text(size = 16),
  legend.text = element_text(size = 16),
  legend.title = element_text(size = 16))

df_all_lexica %>%
  group_by(binned_frequency, language) %>%
  summarise(mean_delta = mean(delta),
            se_delta = sd(delta) / sqrt(n())) %>%
  ggplot(aes(x = binned_frequency,
             y = mean_delta)) +
  geom_point(stat = "identity", size = .2) +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = mean_delta - 2 * se_delta, 
                    ymax = mean_delta + 2 *se_delta), 
                width=.2,
                position=position_dodge(.9)) +
  labs(x = "Binned Frequency",
       y = "Delta (Real - Expected)") +
  geom_smooth() +
  theme_bw() +
  facet_wrap(~language) +
  PlotTheme
## `summarise()` has grouped output by 'binned_frequency'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggsave("../Figures/delta.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_all_lexica %>%
  mutate(Baseline = k,
         Real = num_homophones) %>%
  pivot_longer(c(Baseline, Real), 
               names_to = "Lexicon", 
               values_to = "homophones") %>%
  group_by(binned_frequency, language, Lexicon) %>%
  summarise(mean_homophones = mean(homophones),
            se_homophones = sd(homophones) / sqrt(n())) %>%
  ggplot(aes(x = binned_frequency,
             y = mean_homophones,
             color = Lexicon)) +
  geom_point(stat = "identity", size = .5, alpha = .5) +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = mean_homophones - 2 * se_homophones, 
                    ymax = mean_homophones + 2 * se_homophones), 
                width=.2) +
  labs(x = "Binned Frequency",
       y = "Number of Homophones") +
  geom_smooth(alpha = .6) +
  theme_bw() +
  facet_wrap(~language) +
  PlotTheme +
  theme(panel.spacing.x = unit(1.5, "lines"))
## `summarise()` has grouped output by 'binned_frequency', 'language'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggsave("../Figures/fig1_frequency.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_all_lexica %>%
  mutate(Baseline = k,
         Real = num_homophones) %>%
  pivot_longer(c(Baseline, Real), 
               names_to = "Lexicon", 
               values_to = "homophones") %>%
  group_by(binned_neighorhood_size, language, Lexicon) %>%
  summarise(mean_homophones = mean(homophones),
            se_homophones = sd(homophones) / sqrt(n())) %>%
  ggplot(aes(x = binned_neighorhood_size,
             y = mean_homophones,
             color = Lexicon)) +
  geom_point(stat = "identity", size = .5, alpha = .5) +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = mean_homophones - 2 * se_homophones, 
                    ymax = mean_homophones + 2 * se_homophones), 
                width=.2) +
  labs(x = "Binned Neighborhood Size",
       y = "Number of Homophones") +
  geom_smooth(alpha = .6) +
  theme_bw() +
  facet_wrap(~language) +
  PlotTheme +
  theme(panel.spacing.x = unit(1.5, "lines"))
## `summarise()` has grouped output by 'binned_neighorhood_size', 'language'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggsave("../Figures/fig3_neighbors.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Visualize coefficients

return_regression_table = function(df) {
  # Build full model
  model_full = lm(data = df,
                  delta ~ frequency + num_sylls_est + normalized_surprisal)
  # Tidy model output
  df_model = broom::tidy(model_full)
  df_model
}

### Get coefficients for each language
english_coefficients = df_merged_english_f %>% 
  return_regression_table() %>%
  mutate(language = 'English')

dutch_coefficients = df_merged_dutch_f %>% 
  return_regression_table() %>%
  mutate(language = 'Dutch')

german_coefficients = df_merged_german_f %>% 
  return_regression_table() %>%
  mutate(language = 'German')

french_coefficients = df_french_f %>% 
  return_regression_table() %>%
  mutate(language = 'French')

japanese_coefficients = df_japanese_f %>% 
  return_regression_table() %>%
  mutate(language = 'Japanese')

mandarin_coefficients = df_mandarin_cld_f %>% 
  return_regression_table() %>%
  mutate(language = 'Mandarin')

# Combine into single dataframe
df_all_coefficients = english_coefficients %>%
  rbind(dutch_coefficients) %>%
  rbind(german_coefficients) %>%
  rbind(french_coefficients) %>%
  rbind(japanese_coefficients) %>%
  rbind(mandarin_coefficients) 


df_all_coefficients$predictor = fct_recode(
  df_all_coefficients$term,
  "Number of Syllables" = "num_sylls_est",
  "Normalized Surprisal" = "normalized_surprisal",
  "Log(Frequency)" = "frequency"
)

### Plots outcome of regression
df_all_coefficients %>%
  filter(predictor != "(Intercept)") %>%
  ggplot(aes(x = language,
             y = estimate)) +
  geom_point() +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = estimate - 2*std.error, 
                    ymax = estimate + 2*std.error), 
                width=.2,
                position=position_dodge(.9)) +
  labs(x = "Predictor",
       y = "Estimate") +
  theme_bw() +
  facet_wrap(~predictor) +
  PlotTheme

### Save to figure
ggsave("../Figures/fig2_coefficients.png", dpi = 300)
## Saving 7 x 5 in image

Extension: Neighborhoods

Across all languages tested, it appears that Frequency exhibits a negative relationship with Homophony Delta: more frequent wordforms are less homophonous than one would expect purely on the basis of their phonotactics.

But there are other predictions, again borne out by previous research (Trott & Bergen, 2020), about which areas of the lexicon may see the greatest increase or reduction in homophony. One example is Neighborhood Size. Wordforms with more neighbors might see a corresponding decrease in homophony for one of two reasons. These explanations are not mutually incompatible, though they do suppose distinct causal pathways for the reduction in homophony:

  1. If larger neighborhoods already increase the possibility of perceptual confusion (e.g., "bat" vs. "pat"), one might expect that, wordform \(w_i\) might have fewer homophones than one would expect on account of its phonotactic probability \(p_i\) if \(w_i\) already has a number of neighbors.

  2. If languages select against over-saturation (Trott & Bergen, 2020), and if this selection process ultimately results in larger neighborhood sizes, one would expect that the areas of the lexicon that have undergone the most negative selection relative to their phonotactics should have correspondingly larger neighborhoods.

Helper functions

run_regression_with_neighborhood_size = function(df) {
  
  # initialize output
  out = list()
  
  # Log neighborhood
  df$log_neighborhood = log10(df$neighborhood_size + 1)
  
  # Build full model
  model_full = lm(data = df,
                  delta ~ frequency + num_sylls_est + normalized_surprisal + log_neighborhood)

  # Build reduced model
  model_reduced = lm(data = df,
                     delta ~ num_sylls_est + normalized_surprisal + frequency)
  
  # Run model comparison
  comparison = anova(model_reduced, model_full)
  df_comp = broom::tidy(comparison) %>%
    na.omit()
  
  # Tidy model output
  df_model = broom::tidy(model_full)
  
  # Add to list parameters
  out$model = df_model
  out$comparison = df_comp
  
  out
}

plot_neighborhood_bins_residualized = function(df, n) {
  ### Plot residuals of delta ~ #sylls + surprisal + frequency against neighborhood size
  
  # Build reduced model
  model_reduced = lm(data = df,
                     delta ~ num_sylls_est + normalized_surprisal + frequency)
  df$resid = residuals(model_reduced)
  
  # Plots delta ~ frequency bins.
  df$neighborhood_binned = as.numeric(ntile(df$neighborhood_size, n = n))
  
  g = df %>%
    group_by(neighborhood_binned) %>%
    summarise(mean_delta = mean(resid),
              se_delta = sd(resid) / sqrt(n())) %>%
    ggplot(aes(x = neighborhood_binned,
               y = mean_delta)) +
    geom_point(stat = "identity", size = .2) +
    geom_hline(yintercept = 0, linetype = "dotted") +
    geom_errorbar(aes(ymin = mean_delta - se_delta, 
                      ymax = mean_delta + se_delta), 
                  width=.2,
                  position=position_dodge(.9)) +
    labs(x = "Binned Neighborhood Size",
         y = "Residuals (Reduced Model)") +
    geom_smooth() +
    theme_minimal()

    g
}

plot_outcome = function(df_output) {
  
  df_output$model$predictor = fct_recode(
    df_output$model$term,
    "Number of Syllables" = "num_sylls_est",
    "Normalized Surprisal" = "normalized_surprisal",
    "Log(Frequency)" = "frequency",
    "Log(Neighborhood Size)" = "log_neighborhood"
  )
  
  ### Plots outcome of regression
  g = df_output$model %>%
    ggplot(aes(x = predictor,
               y = estimate)) +
    geom_point() +
    coord_flip() +
    geom_hline(yintercept = 0, linetype = "dotted") +
    geom_errorbar(aes(ymin = estimate - 2*std.error, 
                      ymax = estimate + 2*std.error), 
                  width=.2,
                  position=position_dodge(.9)) +
    labs(x = "Predictor",
         y = "Estimate") +
    theme_minimal()
  
  g
}

English

df_merged_english %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_merged_english %>%
  plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_merged_english %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -5.16     0.394     -13.1  5.27e- 39
## 2 frequency              -0.363    0.0666     -5.45 5.08e-  8
## 3 num_sylls_est           0.326    0.0861      3.79 1.52e-  4
## 4 normalized_surprisal    3.55     0.131      27.1  2.47e-159
## 5 log_neighborhood       -1.59     0.182      -8.76 2.08e- 18
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  28909 2735378.     1 7258.      76.7 2.08e-18

Dutch

df_merged_dutch %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_merged_dutch %>%
  plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_merged_dutch %>% 
  run_regression_with_neighborhood_size()



output %>%
  plot_outcome()

output$model
## # A tibble: 5 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -4.52     0.399     -11.3  8.77e- 30
## 2 frequency              -1.74     0.0727    -23.9  2.41e-125
## 3 num_sylls_est           0.264    0.0769      3.43 6.07e-  4
## 4 normalized_surprisal    4.07     0.165      24.6  6.31e-133
## 5 log_neighborhood       -1.10     0.229      -4.79 1.68e-  6
output$comparison
## # A tibble: 1 × 6
##   res.df       rss    df sumsq statistic    p.value
##    <dbl>     <dbl> <dbl> <dbl>     <dbl>      <dbl>
## 1  63359 16260096.     1 5887.      22.9 0.00000168

German

df_merged_german %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_merged_german %>%
  plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_merged_german %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 × 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)            -3.57     0.456      -7.83 4.92e-15
## 2 frequency              -1.17     0.100     -11.7  2.18e-31
## 3 num_sylls_est           0.145    0.0809      1.79 7.31e- 2
## 4 normalized_surprisal    3.35     0.183      18.4  5.40e-75
## 5 log_neighborhood       -1.85     0.292      -6.34 2.32e-10
output$comparison
## # A tibble: 1 × 6
##   res.df       rss    df sumsq statistic  p.value
##    <dbl>     <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  46715 11451881.     1 9852.      40.2 2.32e-10

French

df_french_merged %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_french_merged %>%
  plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_french_merged %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -2.49     0.301      -8.25 1.59e- 16
## 2 frequency              -0.153    0.0533     -2.87 4.09e-  3
## 3 num_sylls_est          -0.132    0.0556     -2.37 1.76e-  2
## 4 normalized_surprisal    3.07     0.108      28.3  1.07e-174
## 5 log_neighborhood       -3.49     0.143     -24.4  1.27e-130
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df  sumsq statistic   p.value
##    <dbl>    <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1  37273 2645315.     1 42305.      596. 1.27e-130

Japanese

df_japanese_merged %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_japanese_merged %>%
  plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_japanese_merged %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -0.236    0.177      -1.33 1.84e-  1
## 2 frequency              -1.34     0.105     -12.8  3.49e- 37
## 3 num_sylls_est          -0.398    0.0238    -16.7  1.18e- 62
## 4 normalized_surprisal    2.68     0.0842     31.8  1.92e-219
## 5 log_neighborhood       -2.50     0.0737    -34.0  3.37e-249
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df  sumsq statistic   p.value
##    <dbl>    <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1  40444 1173044.     1 33439.     1153. 3.37e-249

Mandarin: Chinese Lexical Database

df_mandarin_merged %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_mandarin_merged %>%
  plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_mandarin_merged %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)             0.284    0.109       2.61 9.18e-  3
## 2 frequency              -0.178    0.0155    -11.5  1.88e- 30
## 3 num_sylls_est          -0.595    0.0259    -23.0  3.83e-116
## 4 normalized_surprisal    2.42     0.0673     35.9  1.17e-278
## 5 log_neighborhood       -2.10     0.0496    -42.4  0
output$comparison
## # A tibble: 1 × 6
##   res.df     rss    df  sumsq statistic p.value
##    <dbl>   <dbl> <dbl>  <dbl>     <dbl>   <dbl>
## 1  41004 290500.     1 12718.     1795.       0

Supplementary 2: Dealing with collinearity

The three predictors (frequency, #syllables, and surprisal) are all correlated. The correlation between the latter two is addressed by using Normalized Surprisal instead of Surprisal. But this leaves the correlation between frequency and #syllables and betwee frequency and surprisal.

Collinearity between predictors can sometimes lead to strange coefficient estimates (even while not affecting overall model fit). There are several ways that researchers try to account for potential collinearity. As a first step, researchers can calculate the variance inflation factor. If the VIF is sufficiently high, that suggests that collinearity might be present.

Below, we calculate the VIF for the main model for all languages. We see that the VIF scores from the complete model for all languages are all below a maximum of 1.43 (in French). This is reassuring.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
m = lm(data = df_merged_english,
       delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
##            frequency        num_sylls_est normalized_surprisal 
##             1.127951             1.482446             1.337387
max(vif(m))
## [1] 1.482446
m = lm(data = df_merged_german,
       delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
##            frequency        num_sylls_est normalized_surprisal 
##             1.097079             1.309988             1.240748
max(vif(m))
## [1] 1.309988
m = lm(data = df_merged_dutch,
       delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
##            frequency        num_sylls_est normalized_surprisal 
##             1.057189             1.316050             1.253957
max(vif(m))
## [1] 1.31605
m = lm(data = df_japanese_merged,
       delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
##            frequency        num_sylls_est normalized_surprisal 
##             1.039111             1.125562             1.098075
max(vif(m))
## [1] 1.125562
m = lm(data = df_french_merged,
       delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
##            frequency        num_sylls_est normalized_surprisal 
##             1.112989             1.478553             1.348290
max(vif(m))
## [1] 1.478553
m = lm(data = df_mandarin_merged,
       delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
##            frequency        num_sylls_est normalized_surprisal 
##             1.131245             1.162525             1.029433
max(vif(m))
## [1] 1.162525

Supplementary 3: Zipf's law across languages

Here, we check whether the original finding (i.e., frequent words are more ambiguous) is indeed significant across languages.

m = glm(data = df_merged_english,
       num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
       family = poisson())
summary(m)
## 
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal, 
##     family = poisson(), data = df_merged_english)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4621  -0.5686  -0.3986  -0.2530   4.3877  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.29534    0.07295 -17.757   <2e-16 ***
## frequency             0.67082    0.01217  55.126   <2e-16 ***
## num_sylls_est        -0.70260    0.01972 -35.632   <2e-16 ***
## normalized_surprisal -0.02284    0.02419  -0.944    0.345    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 23170  on 28913  degrees of freedom
## Residual deviance: 15780  on 28910  degrees of freedom
## AIC: 26494
## 
## Number of Fisher Scoring iterations: 6
m = glm(data = df_merged_german,
       num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
       family = poisson())
summary(m)
## 
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal, 
##     family = poisson(), data = df_merged_german)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1118  -0.2156  -0.1563  -0.1212   5.2721  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.35319    0.16488 -20.337   <2e-16 ***
## frequency             0.88212    0.02730  32.308   <2e-16 ***
## num_sylls_est        -0.53413    0.03763 -14.196   <2e-16 ***
## normalized_surprisal  0.06701    0.05735   1.168    0.243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9008.1  on 46719  degrees of freedom
## Residual deviance: 7126.7  on 46716  degrees of freedom
## AIC: 9402.1
## 
## Number of Fisher Scoring iterations: 7
m = glm(data = df_merged_dutch,
       num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
       family = poisson())
summary(m)
## 
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal, 
##     family = poisson(), data = df_merged_dutch)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7835  -0.2248  -0.1559  -0.1168   4.8731  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.10770    0.11126 -18.945  < 2e-16 ***
## frequency             0.63707    0.01660  38.384  < 2e-16 ***
## num_sylls_est        -0.99196    0.03195 -31.047  < 2e-16 ***
## normalized_surprisal  0.12656    0.03544   3.571 0.000355 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 15734  on 63363  degrees of freedom
## Residual deviance: 11000  on 63360  degrees of freedom
## AIC: 14952
## 
## Number of Fisher Scoring iterations: 7
m = glm(data = df_japanese_merged,
       num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
       family = poisson())
summary(m)
## 
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal, 
##     family = poisson(), data = df_japanese_merged)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0122  -0.7595  -0.4818  -0.1831   8.8324  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.62451    0.05695   46.08   <2e-16 ***
## frequency             0.56094    0.01549   36.22   <2e-16 ***
## num_sylls_est        -1.02396    0.01248  -82.04   <2e-16 ***
## normalized_surprisal -0.51095    0.02442  -20.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 46530  on 40448  degrees of freedom
## Residual deviance: 32937  on 40445  degrees of freedom
## AIC: 47052
## 
## Number of Fisher Scoring iterations: 6
m = glm(data = df_french_merged,
       num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
       family = poisson())
summary(m)
## 
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal, 
##     family = poisson(), data = df_french_merged)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4353  -0.5689  -0.4168  -0.3199   5.5647  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.44129    0.07151 -34.141   <2e-16 ***
## frequency             0.75907    0.01220  62.200   <2e-16 ***
## num_sylls_est        -0.19130    0.01511 -12.661   <2e-16 ***
## normalized_surprisal  0.02898    0.02442   1.186    0.235    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 26447  on 37277  degrees of freedom
## Residual deviance: 21159  on 37274  degrees of freedom
## AIC: 32466
## 
## Number of Fisher Scoring iterations: 6
m = glm(data = df_mandarin_merged,
       num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
       family = poisson())
summary(m)
## 
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal, 
##     family = poisson(), data = df_mandarin_merged)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1664  -0.3767  -0.2930  -0.0850   7.2235  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.40941    0.12936   18.62   <2e-16 ***
## frequency             0.54218    0.01597   33.96   <2e-16 ***
## num_sylls_est        -2.71580    0.04033  -67.34   <2e-16 ***
## normalized_surprisal -1.08918    0.06335  -17.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 27993  on 41008  degrees of freedom
## Residual deviance: 13684  on 41005  degrees of freedom
## AIC: 19636
## 
## Number of Fisher Scoring iterations: 6

Supplementary 4: Using SUBTLEX for German, Dutch, and English

English

First, we load and process the data.

## setwd("/Users/seantrott/Dropbox/UCSD/Research/Ambiguity/Evolution/homophony_delta/analyses")
df_english = read_data("../data/processed/english/reals/english_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   index = col_double(),
##   Word = col_character(),
##   CobLog = col_double(),
##   CompCnt = col_double(),
##   PhonDISC = col_character(),
##   Class = col_character(),
##   SylCnt = col_double(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double(),
##   rank_num_homophones = col_double(),
##   neighborhood_size = col_double(),
##   heldout_log_prob = col_double(),
##   heldout_surprisal = col_double()
## )
nrow(df_english)
## [1] 35107
## Now normalize probabilities and compute expected homophones per wordform
df_english = df_english %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_english %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 9 × 3
##   num_sylls_est total correct_total
##           <dbl> <dbl>         <dbl>
## 1             1  7706          7706
## 2             2 15247         15247
## 3             3 11379         11379
## 4             4  5316          5316
## 5             5  1694          1694
## 6             6   439           439
## 7             7    95            95
## 8             8    10            10
## 9            10     1             1

We then merge with frequency data.

df_subtlex = read_csv("../data/frequency/english/SUBTLEXusfrequencyabove1.csv")
## Parsed with column specification:
## cols(
##   Word = col_character(),
##   FREQcount = col_double(),
##   CDcount = col_double(),
##   FREQlow = col_double(),
##   Cdlow = col_double(),
##   SUBTLWF = col_double(),
##   Lg10WF = col_double(),
##   SUBTLCD = col_double(),
##   Lg10CD = col_double()
## )
nrow(df_subtlex)
## [1] 60384
df_subtlex$Word = tolower(df_subtlex$Word)

df_merged_english = df_english %>%
  inner_join(df_subtlex, on = "Word")
## Joining, by = "Word"
nrow(df_merged_english)
## [1] 24307
df_merged_english$frequency = df_merged_english$Lg10WF

Now we can visualize the outcome in a variety of ways:

df_merged_english %>%
  plot_comparison()

df_merged_english %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_english %>% 
  run_regression()

output %>%
  plot_outcome()
## Warning: Unknown levels in `f`: log_neighborhood

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -6.59     0.425     -15.5  4.55e- 54
## 2 frequency              -0.655    0.0837     -7.82 5.29e- 15
## 3 num_sylls_est           0.674    0.0791      8.52 1.71e- 17
## 4 normalized_surprisal    3.92     0.154      25.4  1.95e-140
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  24303 2731590.     1 6882.      61.2 5.29e-15

Directly contrast the relationships:

df_merged_english %>%
  plot_contrast(bins = 20)

Dutch

First, we load and process the data.

df_dutch = read_data("../data/processed/dutch/reals/dutch_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   PhonStrsDISC = col_character(),
##   PhonCVBr = col_character(),
##   PhonSylBCLX = col_character(),
##   PhonStrsStDISC = col_character(),
##   PhonStCVBr = col_character(),
##   PhonSylStBCLX = col_character(),
##   PhonolCLX = col_character(),
##   PhonolCPA = col_character(),
##   PhonDISC = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_dutch)
## [1] 65260
## Now normalize probabilities and compute expected homophones per wordform
df_dutch = df_dutch %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_dutch %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 9 × 3
##   num_sylls_est total correct_total
##           <dbl> <dbl>         <dbl>
## 1             1  4672          4672
## 2             2 20588         20588
## 3             3 26420         26420
## 4             4 11338         11338
## 5             5  3394          3394
## 6             6   846           846
## 7             7   189           189
## 8             8    28            28
## 9             9     2             2

We then merge with frequency data.

df_subtlex = read_delim("../data/frequency/dutch/SUBTLEX-NL.txt", delim="\t")
## Parsed with column specification:
## cols(
##   Word = col_character(),
##   FREQcount = col_double(),
##   CDcount = col_double(),
##   FREQlow = col_double(),
##   CDlow = col_double(),
##   FREQlemma = col_double(),
##   SUBTLEXWF = col_double(),
##   Lg10WF = col_double(),
##   SUBTLEXCD = col_double(),
##   Lg10CD = col_double()
## )
nrow(df_subtlex)
## [1] 437503
df_subtlex$Word = tolower(df_subtlex$Word)

df_merged_dutch = df_dutch %>%
  inner_join(df_subtlex, on = "Word") 
## Joining, by = "Word"
nrow(df_merged_dutch)
## [1] 28330
df_merged_dutch$frequency = df_merged_dutch$Lg10WF

Now we can visualize the outcome in a variety of ways:

df_merged_dutch %>%
  plot_comparison()

df_merged_dutch %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_dutch %>% 
  run_regression()

output %>%
  plot_outcome()
## Warning: Unknown levels in `f`: log_neighborhood

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)            -6.64      0.834     -7.96 1.73e-15
## 2 frequency              -3.38      0.170    -19.8  4.90e-87
## 3 num_sylls_est           0.289     0.164      1.77 7.75e- 2
## 4 normalized_surprisal    6.58      0.328     20.1  7.51e-89
output$comparison
## # A tibble: 1 × 6
##   res.df       rss    df   sumsq statistic  p.value
##    <dbl>     <dbl> <dbl>   <dbl>     <dbl>    <dbl>
## 1  28326 15675501.     1 217909.      394. 4.90e-87

Directly contrast the relationships:

df_merged_dutch %>%
  plot_contrast(bins = 20)

German

df_german = read_data("../data/processed/german/reals/german_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   index = col_double(),
##   Word = col_character(),
##   CompCnt = col_double(),
##   PhonDISC = col_character(),
##   SylCnt = col_double(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double(),
##   rank_num_homophones = col_double(),
##   neighborhood_size = col_double(),
##   heldout_log_prob = col_double(),
##   heldout_surprisal = col_double()
## )
nrow(df_german)
## [1] 50435
## Now normalize probabilities and compute expected homophones per wordform
df_german = df_german %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_german %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## # A tibble: 10 × 3
##    num_sylls_est total correct_total
##            <dbl> <dbl>         <dbl>
##  1             1  2454          2454
##  2             2 13181         13181
##  3             3 19429         19429
##  4             4 10983         10983
##  5             5  3971          3971
##  6             6  1240          1240
##  7             7   328           328
##  8             8    72            72
##  9             9    16            16
## 10            10     2             2

We then merge with frequency data.

df_freq = read_csv("../data/frequency/german/SUBTLEX-DE.csv")
## Parsed with column specification:
## cols(
##   Word = col_character(),
##   `Spell check` = col_double(),
##   FREQcount = col_double()
## )
nrow(df_freq)
## [1] 377524
# df_freq$Word = tolower(df_freq$Word)

# log10 frequency (+ 1). SHould be analogous to measures from English.
df_freq$Lg10WF = log10(df_freq$FREQcount + 1)

df_merged_german = df_german %>%
  inner_join(df_freq, on = "Word") 
## Joining, by = "Word"
nrow(df_merged_german)
## [1] 24350
df_merged_german$frequency = df_merged_german$Lg10WF

Now we can visualize the outcome in a variety of ways:

df_merged_german %>%
  plot_comparison()

df_merged_german %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_german %>% 
  run_regression()

output %>%
  plot_outcome()
## Warning: Unknown levels in `f`: log_neighborhood

output$model
## # A tibble: 4 × 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -4.76     0.399     -12.0  7.99e- 33
## 2 frequency              -1.67     0.0888    -18.8  3.12e- 78
## 3 num_sylls_est           0.316    0.0731      4.32 1.55e-  5
## 4 normalized_surprisal    4.19     0.157      26.7  2.36e-154
output$comparison
## # A tibble: 1 × 6
##   res.df      rss    df  sumsq statistic  p.value
##    <dbl>    <dbl> <dbl>  <dbl>     <dbl>    <dbl>
## 1  24346 2750831.     1 39903.      353. 3.12e-78

Directly contrast the relationships:

df_merged_german %>%
  plot_contrast(bins = 20)